Skip to content

#fix 35#41

Merged
freelw merged 2 commits intomainfrom
wangli_dev_20250617_2
Jun 17, 2025
Merged

#fix 35#41
freelw merged 2 commits intomainfrom
wangli_dev_20250617_2

Conversation

@freelw
Copy link
Copy Markdown
Owner

@freelw freelw commented Jun 17, 2025

#fix 35

freelw added 2 commits June 17, 2025 12:21
@freelw
Copy link
Copy Markdown
Owner Author

freelw commented Jun 17, 2025

./lm
corpus : ./resources/time_machine/timemachine_preprocessed.txt
epochs : 10
batch_size : 16
dropout : 0.2
gpu : 1
learning rate : 0.001
checkpoint :
max_words_cnt : 256
token_ids_size : 256
Allocating memory
for tensors : 36609236 bytes,
for c_tensors: 3194706336 bytes
for grad_tensors: 1241779004 bytes
epoch 0 : [192/224]loss : 5.84092
epoch 1 : [192/224]loss : 2.01121
epoch 2 : [32/224]loss : 0.76783
checkpoint saved : ./checkpoints/checkpoint_20250617_122712_3.bin
@dratman

@freelw freelw requested a review from dratman June 17, 2025 04:28
@freelw
Copy link
Copy Markdown
Owner Author

freelw commented Jun 17, 2025

The issue has been fixed. The cause was that an int was incorrectly used instead of an unsigned int when calculating the offset of the Metal buffer. I readjusted the default value of batch_size to 16, and in this case, the program can run correctly, but it triggers swap on my MacBook, making the machine run a bit slowly. Therefore, it's generally recommended to reduce it, such as using the -b 8 configuration. Additionally, batch_size can be set slightly larger, but not by much, as it doesn't support video memory exceeding the range of a uint. You can try the latest code from the main branch and check if the loss decreases with the default parameters on your MacBook. @dratman

@freelw freelw merged commit eac9698 into main Jun 17, 2025
1 check passed
@freelw
Copy link
Copy Markdown
Owner Author

freelw commented Jun 17, 2025

fix #35

@freelw
Copy link
Copy Markdown
Owner Author

freelw commented Jun 17, 2025

(base) ➜ cpp-transformer git:(main) ./lm
corpus : ./resources/time_machine/timemachine_preprocessed.txt
epochs : 10
batch_size : 16
dropout : 0.2
gpu : 1
learning rate : 0.001
checkpoint :
max_words_cnt : 256
token_ids_size : 256
Allocating memory
for tensors : 36609236 bytes,
for c_tensors: 3194706336 bytes
for grad_tensors: 1241779004 bytes
epoch 0 : [192/224]loss : 6.02557
epoch 1 : [192/224]loss : 2.02572
epoch 2 : [192/224]loss : 0.380724
epoch 3 : [192/224]loss : 0.0781031
epoch 4 : [192/224]loss : 0.0326921
epoch 5 : [192/224]loss : 0.0226658
epoch 6 : [192/224]loss : 0.019125
epoch 7 : [192/224]loss : 0.0176224
epoch 8 : [192/224]loss : 0.0167923
epoch 9 : [192/224]loss : 0.0157845
checkpoint saved : ./checkpoints/checkpoint_20250617_125929_9.bin

@dratman
Copy link
Copy Markdown
Collaborator

dratman commented Jun 17, 2025

Fortunately my MacBook has 64 GByte shared RAM so not likely to be a problem. Will continue testing in the morning.

@freelw
Copy link
Copy Markdown
Owner Author

freelw commented Jun 17, 2025

Fortunately my MacBook has 64 GByte shared RAM so not likely to be a problem. Will continue testing in the morning.

I'm so envious of you. By the way, should I support video memory exceeding 4GB? Hahahaha

@dratman
Copy link
Copy Markdown
Collaborator

dratman commented Jun 18, 2025 via email

@freelw
Copy link
Copy Markdown
Owner Author

freelw commented Jun 18, 2025

@dratman
I am amazed that you still maintain a passion for learning at such an age, which is truly admirable!

First of all, I wish you good health.

May I directly contact you via the email published on your GitHub? My email is freelw81@qq.com.

Regarding the issue mentioned above: The output did not change significantly with the modification of the prompt. I think there might be two reasons:

  1. The training data we used only contains 256 words, which may cause the model's output to lack diversity. We can use the parameter -m 10000000 to let the program train with the complete Time Machine data. On my machine, one epoch takes 4 hours.
  2. The positional encoding I used is absolute positional encoding, which should differ from the standard GPT-2. I am still learning about this field and plan to try other positional encodings in future versions.

@dratman
Copy link
Copy Markdown
Collaborator

dratman commented Jun 18, 2025 via email

@freelw
Copy link
Copy Markdown
Owner Author

freelw commented Jun 19, 2025

Of course, feel free to send email to my regular address. --- By "the full
Time Machine data" you mean the 178 KByte novel in the file
timemachine.txt? I assumed I was already training with that. To train with
the whole novel, I just add "-m 10000000" to the training command? 4 hours
per epoch is no problem. I can easily let it run overnight or even for
several days.

If you add a parameter like -m 10000000, the output should contain the wording

./lm -m 100000000
corpus : ./resources/time_machine/timemachine_preprocessed.txt
epochs : 10
batch_size : 16
dropout : 0.2
gpu : 1
learning rate : 0.001
checkpoint : 
max_words_cnt : 100000000
token_ids_size : 32775
Allocating memory  
for tensors : 36609236 bytes, 
for c_tensors: 3194706336 bytes 
for grad_tensors: 1241779004 bytes
epoch 0 :  [80/32743]loss : 7.45601

What you need to note is that the denominator of the "epoch" line is 32743, and the token_ids_size is 32775.
Only when such wording is seen can it indicate that the full-text of the novel was used for training.
@dratman

@freelw
Copy link
Copy Markdown
Owner Author

freelw commented Jun 19, 2025

You're really amazing! My project was also inspired by two projects of Andrej Karpathy, one is llm.c, and the other is micrograd.I see you've already paid attention to llm.c. It's indeed a remarkable project, but I don't think it's particularly suitable for understanding deep learning. I highly recommend micrograd – its implementation is simple. If you have basic concepts about the backpropagation mechanism, this project might suddenly make things click for you, just like it did for me. It's extremely inspiring: simple to implement yet perfect for learning.

About my continued interest in this field: I spent two years as an
undergraduate at UC Berkeley in 1969-1971, concentrating on physics and
math. Later I found a career in both logic design and software development.
Many years and events went by. My wife and I raised two children but lost
one to a drug overdose. Over time I began to feel old, and my ability to
learn new technical material was actually declining until about 2022, when
I first found out about the astonishing architecture of the GPT-type
language models. The high-dimensional vectors I read about in connection
with GPT-3, successively modified through dozens of processing layers, were
unlike anything I could have imagined as a way of representing word,
sub-word or character tokens. The idea of changing an integer representing
an English letter, word, or a Chinese character into a
thousand-dimensional vector of floating-point numbers seemed to defy common
sense. I was immediately determined to understand what was going on. I
started reading extensively and watching youtube videos, and gradually --
to my surprise -- some of my technical acumen returned. I began playing
with Andrej Karpathy's makemore and similar small models.

I am still trying to fully grasp how it is possible that this bizarre
system of vectors and weights can understand what I write, and then reply
in ways that often help me understand some topic more quickly than by the
old methods of study.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants